The textbook for the Data Science course series is freely available online.

Learning Objectives

Course Overview

Section 1: Introduction to Data Visualization and Distributions

You will get started with data visualization and distributions in R.

Section 2: Introduction to ggplot2

You will learn how to use ggplot2 to create plots.

Section 3: Summarizing with dplyr

You will learn how to summarize data using dplyr.

Section 4: Gapminder

You will see examples of ggplot2 and dplyr in action with the Gapminder dataset.

Section 5: Data Visualization Principles

You will learn general principles to guide you in developing effective data visualizations.

Section 1 Overview

Section 1 introduces you to Data Visualization and Distributions.

After completing Section 1, you will:

Introduction to Data Visualization

The textbook for this section is available here

Key points

Code

library(dslabs)
data(murders)
head(murders)
##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

Introduction to Distributions

The textbook for this section is available here

Key points

Data Types

The textbook for this section is available here

Key points

Assessment - Data Types

  1. The type of data we are working with will often influence the data visualization technique we use.

We will be working with two types of variables: categorical and numeric. Each can be divided into two other groups: categorical can be ordinal or not, whereas numerical variables can be discrete or continuous.

We will review data types using some of the examples provided in the dslabs package. For example, the heights dataset.

library(dslabs)
data(heights)
data(heights)
names(heights)
## [1] "sex"    "height"
  1. We saw that sex is the first variable. We know what values are represented by this variable and can confirm this by looking at the first few entires:
head(heights)
##      sex height
## 1   Male     75
## 2   Male     70
## 3   Male     68
## 4   Male     74
## 5   Male     61
## 6 Female     65

What data type is the sex variable?

  1. Keep in mind that discrete numeric data can be considered ordinal.

Although this is technically true, we usually reserve the term ordinal data for variables belonging to a small number of different groups, with each group having many members.

The height variable could be ordinal if, for example, we report a small number of values such as short, medium, and tall. Let’s explore how many unique values are used by the heights variable. For this we can use the unique function:

x <- c(3, 3, 3, 3, 4, 4, 2)
unique(x)
x <- heights$height
length(unique(x))
## [1] 139
  1. One of the useful outputs of data visualization is that we can learn about the distribution of variables.

For categorical data we can construct this distribution by simply computing the frequency of each unique value. This can be done with the function table. Here is an example:

x <- c(3, 3, 3, 3, 4, 4, 2)
table(x)
x <- heights$height
tab <- table(x)
  1. To see why treating the reported heights as an ordinal value is not useful in practice we note how many values are reported only once.

In the previous exercise we computed the variable tab which reports the number of times each unique value appears. For values reported only once tab will be 1. Use logicals and the function sum to count the number of times this happens.

tab <- table(heights$height)
sum(tab==1)
## [1] 63
  1. Since there are a finite number of reported heights and technically the height can be considered ordinal, which of the following is true:

Describe Heights to ET

The textbook for this section is available:

Key points

Code

# load the dataset
library(dslabs)
data(heights)
# make a table of category proportions
prop.table(table(heights$sex))
## 
##    Female      Male 
## 0.2266667 0.7733333

Smooth Density Plots

The textbook for this section is available here

Key points

A further note on histograms: note that the choice of binwidth has a determinative effect on shape. There is no “true” choice for binwidth, and you can sometimes gain insights into the data by experimenting with binwidths.

Assessment - Distributions

  1. You may have noticed that numerical data is often summarized with the average value.

For example, the quality of a high school is sometimes summarized with one number: the average score on a standardized test. Occasionally, a second number is reported: the standard deviation. So, for example, you might read a report stating that scores were 680 plus or minus 50 (the standard deviation). The report has summarized an entire vector of scores with with just two numbers. Is this appropriate? Is there any important piece of information that we are missing by only looking at this summary rather than the entire list? We are going to learn when these 2 numbers are enough and when we need more elaborate summaries and plots to describe the data.

Our first data visualization building block is learning to summarize lists of factors or numeric vectors. The most basic statistical summary of a list of objects or numbers is its distribution. Once a vector has been summarized as distribution, there are several data visualization techniques to effectively relay this information. In later assessments we will practice to write code for data visualization. Here we start with some multiple choice questions to test your understanding of distributions and related basic plots.

In the murders dataset, the region is a categorical variable and on the right you can see its distribution (Figure 1). To the closest 5%, what proportion of the states are in the North Central region?

Region vs. Proportion

  1. In the murders dataset, the region is a categorical variable and to the right is its distribution.

Which of the following is true:

  1. The plot (Figure 2) shows the eCDF for male heights.

Based on the plot, what percentage of males are shorter than 75 inches?

eCDF for male heights

  1. To the closest inch, what height m has the property that 1/2 of the male students are taller than m and 1/2 are shorter?
  1. Here is an eCDF of the murder rates across states (Figure 3).

Knowing that there are 51 states (counting DC) and based on this plot, how many states have murder rates larger than 10 per 100,000 people?

eCDF of the murder rates across states

  1. Based on the eCDF above, which of the following statements are true.
  1. Here is a histogram of male heights in our heights dataset.

Based on this plot (Figure 4), how many males are between 62.5 and 65.5?

Histogram of male heights

  1. About what percentage are shorter than 60 inches?
  1. Based on this density plot (Figure 5), about what proportion of US states have populations larger than 10 million?

Density plot population

  1. Here are three density plots (Figure 6). Is it possible that they are from the same dataset?

Which of the following statements is true?

Three density plots

Normal Distribution

The textbook for this section is available here

Key points

\(Z = \frac{x - (\mu)}{\sigma}\)

Equation for the normal distribution

The normal distribution is mathematically defined by the following formula for any mean \(\mu\) and standard deviation \(\sigma\):

\(Pr(a < x < b) = \int_{a}^{b} \frac{1}{\sqrt2\pi\sigma} e^-\frac{1}{2}(\frac{x - \mu}{\sigma})^2 dx\)

Code

# define x as vector of male heights
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
index <- heights$sex=="Male"
x <- heights$height[index]

# calculate the mean and standard deviation manually
average <- sum(x)/length(x)
SD <- sqrt(sum((x - average)^2)/length(x))

# built-in mean and sd functions - note that the audio and printed values disagree
average <- mean(x)
SD <- sd(x)
c(average = average, SD = SD)
##   average        SD 
## 69.314755  3.611024
# calculate standard units
z <- scale(x)

# calculate proportion of values within 2 SD of mean
mean(abs(z) < 2)
## [1] 0.9495074

Note about the sd function: The built-in R function sd calculates the standard deviation, but it divides by length(x)-1 instead of length(x). When the length of the list is large, this difference is negligible and you can use the built-in sd function. Otherwise, you should compute \(\sigma\) by hand. For this course series, assume that you should use the sd function unless you are told not to do so.

Assessment - Normal Distribution

  1. Histograms and density plots provide excellent summaries of a distribution.

But can we summarize even further? We often see the average and standard deviation used as summary statistics: a two number summary! To understand what these summaries are and why they are so widely used, we need to understand the normal distribution.

The normal distribution, also known as the bell curve and as the Gaussian distribution, is one of the most famous mathematical concepts in history. A reason for this is that approximately normal distributions occur in many situations. Examples include gambling winnings, heights, weights, blood pressure, standardized test scores, and experimental measurement errors. Often data visualization is needed to confirm that our data follows a normal distribution.

Here we focus on how the normal distribution helps us summarize data and can be useful in practice.

One way the normal distribution is useful is that it can be used to approximate the distribution of a list of numbers without having access to the entire list. We will demonstrate this with the heights dataset.

Load the height data set and create a vector x with just the male heights:

library(dslabs)
data(heights)
x <- heights$height[heights$sex == "Male"]

What proportion of the data is between 69 and 72 inches (taller than 69 but shorter or equal to 72)? A proportion is between 0 and 1.

x <- heights$height[heights$sex == "Male"]
mean(x > 69 & x <= 72)
## [1] 0.3337438
  1. Suppose all you know about the height data from the previous exercise is the average and the standard deviation and that its distribution is approximated by the normal distribution.

We can compute the average and standard deviation like this:

library(dslabs)
data(heights)
x <- heights$height[heights$sex=="Male"]
avg <- mean(x)
stdev <- sd(x)

Suppose you only have avg and stdev below, but no access to x, can you approximate the proportion of the data that is between 69 and 72 inches?

Given a normal distribution with a mean mu and standard deviation sigma, you can calculate the proportion of observations less than or equal to a certain value with pnorm(value, mu, sigma). Notice that this is the CDF for the normal distribution. We will learn much more about pnorm later in the course series, but you can also learn more now with ?pnorm.

x <- heights$height[heights$sex=="Male"]
avg <- mean(x)
stdev <- sd(x)
pnorm(72, avg, stdev) - pnorm(69, avg, stdev)
## [1] 0.3061779
  1. Notice that the approximation calculated in the second question is very close to the exact calculation in the first question.

The normal distribution was a useful approximation for this case. However, the approximation is not always useful. An example is for the more extreme values, often called the “tails” of the distribution. Let’s look at an example. We can compute the proportion of heights between 79 and 81.

library(dslabs)  
data(heights)
x <- heights$height[heights$sex == "Male"]  
mean(x > 79 & x <= 81)  
x <- heights$height[heights$sex == "Male"]
avg <- mean(x)
stdev <- sd(x)
exact <- mean(x > 79 & x <= 81)
approx <- pnorm(81, avg, stdev) - pnorm(79, avg, stdev)
exact
## [1] 0.004926108
approx
## [1] 0.003051617
exact/approx
## [1] 1.614261
  1. Someone asks you what percent of seven footers are in the National Basketball Association (NBA). Can you provide an estimate? Let’s try using the normal approximation to answer this question.

First, we will estimate the proportion of adult men that are 7 feet tall or taller.

Assume that the distribution of adult men in the world as normally distributed with an average of 69 inches and a standard deviation of 3 inches.

# use pnorm to calculate the proportion over 7 feet (7*12 inches)
1 - pnorm(7*12, 69, 3)
## [1] 2.866516e-07
  1. Now we have an approximation for the proportion, call it p, of men that are 7 feet tall or taller.

We know that there are about 1 billion men between the ages of 18 and 40 in the world, the age range for the NBA.

Can we use the normal distribution to estimate how many of these 1 billion men are at least seven feet tall?

p <- 1 - pnorm(7*12, 69, 3)
round(p*10^9)
## [1] 287
  1. There are about 10 National Basketball Association (NBA) players that are 7 feet tall or higher.
p <- 1 - pnorm(7*12, 69, 3)
N <- round(p*10^9)
10/N
## [1] 0.03484321
  1. In the previous exerceise we estimated the proportion of seven footers in the NBA using this simple code:
p <- 1 - pnorm(7*12, 69, 3)  
N <- round(p * 10^9)  
10/N  

Repeat the calculations performed in the previous question for Lebron James’ height: 6 feet 8 inches. There are about 150 players, instead of 10, that are at least that tall in the NBA.

## Change the solution to previous answer
p <- 1 - pnorm(7*12, 69, 3)
N <- round(p * 10^9)
10/N
## [1] 0.03484321
p <- 1 - pnorm(6*12+8, 69, 3)
N <- round(p * 10^9)
150/N
## [1] 0.001220842
  1. In answering the previous questions, we found that it is not at all rare for a seven footer to become an NBA player.

What would be a fair critique of our calculations?

Assessment 4 (Quantiles, percentiles, and boxplots)

  1. Vector lengths

When analyzing data it’s often important to know the number of measurements you have for each category. Define a variable male that contains the male heights. Define a variable female that contains the female heights. Report the length of each variable.

library(dslabs)
data(heights)
male <- heights$height[heights$sex=="Male"]
female <- heights$height[heights$sex=="Female"]

length(male)
## [1] 812
length(female)
## [1] 238
  1. Percentiles

Suppose we can’t make a plot and want to compare the distributions side by side. We can’t just list all the numbers. Instead, we will look at the percentiles. Create a five row table showing female_percentiles and male_percentiles with the 10th, 30th, 50th, ., 90th percentiles for each sex. Then create a data frame with these two as columns.

library(dslabs)
data(heights)
male <- heights$height[heights$sex=="Male"]
female <- heights$height[heights$sex=="Female"]

male_percentiles <- quantile(male, seq(0.1, 0.9,0.2))
female_percentiles <- quantile(female, seq(0.1, 0.9,0.2))

df <- data.frame(female = female_percentiles, male = male_percentiles)
df
        female          male
        <dbl>           <dbl>
10% 61.00000    65.00000
30% 63.00000    68.00000
50% 64.98031    69.00000
70% 66.46417    71.00000
90% 69.00000    73.22751
5 rows
  1. Interpretating Boxplots - 1

Study the boxplots summarizing the distributions of populations sizes by country.

Which continent has the country with the largest population size?

index

  1. Interpretating Boxplots - 2

Study the boxplots summarizing the distributions of populations sizes by country.

Which continent has median country with the largest population?

index

  1. Interpretating Boxplots - 3

Again, look at the boxplots summarizing the distributions of populations sizes by country. To the nearest million, what is the median population size for Africa?

index

  1. Low quantiles

Examine the following boxplots and report approximately what proportion of countries in Europe have populations below 14 million:

index

  1. Interquantile Range (IQR)

Based on the boxplot, if we use a log transformation, which continent shown below has the largest interquartile range?

index

Assessment 5 (Robust Summaries With Outliers)

  1. Exploring the Galton Dataset - Average and Median

For this chapter, we will use height data collected by Francis Galton for his genetics studies. Here we just use height of the children in the dataset:

library(HistData)
data(Galton)
x <- Galton$child

Compute the average and median of these data. Note: do not assign them to a variable.

mean(x)
## [1] 68.08847
median(x)
## [1] 68.2
  1. Exploring the Galton Dataset - SD and MAD

Now for the same data compute the standard deviation and the median absolute deviation (MAD).

sd(x)
## [1] 2.517941
mad(x)
## [1] 2.9652
  1. Error impact on average

In the previous exercises we saw that the mean and median are very similar and so are the standard deviation and MAD. This is expected since the data is approximated by a normal distribution which has this propoerty.

Now suppose that suppose Galton made a mistake when entering the first value, forgetting to use the decimal point. You can imitate this error by typing:

library(HistData)
data(Galton)
x <- Galton$child
x_with_error <- x
x_with_error[1] <- x_with_error[1]*10

The data now has an outlier that the normal approximation does not account for. Let’s see how this affects the average.

Report how many inches the average grow after this mistake. Specifically, report the difference between the average of the data with the mistake x_with_error and the data without the mistake x.

mean(x_with_error)-mean(x)
## [1] 0.5983836
  1. Error impact on SD

In the previous exercise we saw how a simple mistake can result in the average of our data increasing more than half a foot, which is a large difference in practical terms. Now let’s explore the effect this outlier has on the standard deviation.

Report how many inches the SD grows after this mistake. Specifically, report the difference between the SD of the data with the mistake x_with_error and the data without the mistake x.

x_with_error <- x
x_with_error[1] <- x_with_error[1]*10

sd(x_with_error)-sd(x)
## [1] 15.6746
  1. Error impact on median

In the previous exercises we saw how one mistake can have a substantial effect on the average and the standard deviation.

Now we are going to see how the median and MAD are much more resistant to outliers. For this reason we say that they are robust summaries.

Report how many inches the median grows after the mistake. Specifically, report the difference between the median of the data with the mistake x_with_error and the data without the mistake x.

x_with_error <- x
x_with_error[1] <- x_with_error[1]*10

median(x_with_error)-median(x)
## [1] 0
  1. Error impact on MAD

We saw that the median barely changes. Now let’s see how the MAD is affected.

Report how many inches the MAD grows after the mistake. Specifically, report the difference between the MAD of the data with the mistake x_with_error and the data without the mistake x.

x_with_error <- x
x_with_error[1] <- x_with_error[1]*10

mad(x_with_error)-mad(x)
## [1] 0
  1. Usefulness of EDA

How could you use exploratory data analysis to detect that an error was made? - [ ] A. Since it is only one value out of many, we will not be able to detect this. - [ ] B. We would see an obvious shift in the distribution. - [X] C. A boxplot, histogram, or qq-plot would reveal a clear outlier. - [ ] D. A scatter plot would show high levels of measurement error.

  1. Using EDA to explore changes

We have seen how the average can be affected by outliers. But how large can this effect get? This of course depends on the size of the outlier and the size of the dataset.

To see how outliers can affect the average of a dataset, let’s write a simple function that takes the size of the outlier as input and returns the average.

Write a function called error_avg that takes a value k and returns the average of the vector x after the first entry changed to k. Show the results for k=10000 and k=-10000.

x <- Galton$child

error_avg <- function(k){
  x[1] <- k
  mean(x)
}

error_avg(10000)
## [1] 78.79784
error_avg(-10000)
## [1] 57.24612

Section 2 Overview

In Section 2, you will learn how to create data visualizations in R using ggplot2.

After completing Section 2, you will: - be able to use ggplot2 to create data visualizations in R. - be able to explain what the data component of a graph is. - be able to identify the geometry component of a graph and know when to use which type of geometry. - be able to explain what the aesthetic mapping component of a graph is. - be able to understand the scale component of a graph and select an appropriate scale component to use.

The textbook for this section is available here

Assessment 6 (ggplot2)

Start by loading the dplyr and ggplot2 library as well as the murders and heights data.

library(dplyr)
library(ggplot2)
library(dslabs)
data(heights)
data(murders)
  1. ggplot2 basics

With ggplot2 plots can be saved as objects. For example we can associate a dataset with a plot object like this

p <- ggplot(data = murders)

Because data is the first argument we don’t need to spell it out

p <- ggplot(murders)

or, if we load dplyr, we can also use the pipe:

p <- murders %>% ggplot()

Remember the pipe sends the object on the left of %>% to be the first argument for the function the right of %>%. What is class of the object p?

class(p)
## [1] "gg"     "ggplot"
  1. Printing

Remember that to print an object you can use the command print or simply type the object. For example

x <- 2
x
## [1] 2
print(x)
## [1] 2

Print the object p defined in exercise one and describe what you see. - [ ] A. Nothing happens. - [X] B. A blank slate plot. - [ ] C. A scatter plot. - [ ] D. A histogram.

  1. Pipes

Using the pipe %>%, create an object p but this time associated with the heights dataset instead of the murders dataset.

# define ggplot object called p like in the previous exercise but using a pipe 
p <- heights %>% ggplot()

# What is the class of the object p you have just created?
class(p)
## [1] "gg"     "ggplot"
  1. Layers

Now we are going to add a layers and the corresponding aesthetic mappings. For the murders data we plotted total murders versus population sizes. Explore the murders data frame to remind yourself what are the names for these two variables and select the correct answer. Hint: Look at ?murders.

  1. geom_point 1

To create the scatter plot we add a layer with geom_point. The aesthetic mappings require us to define the x-axis and y-axis variables respectively. So the code looks like this:

murders %>% ggplot(aes(x = , y = )) + geom_point()

except we have to define the two variables x and y. Fill this out with the correct variable names.

## Fill in the blanks
murders %>% ggplot(aes(x =population , y =total )) +
  geom_point()

index

  1. geom_point 1

Note that if we don’t use argument names, we can obtain the same plot by making sure we enter the variable names in the right order like this:

murders %>% ggplot(aes(population, total)) +
  geom_point()

index

Remake the plot but now with total in the x-axis and population in the y-axis.

murders %>% ggplot(aes(total,population)) +
  geom_point()

index

  1. geom_point text

If instead of points we want to add text, we can use the geom_text() or geom_label() geometries. The following code

murders %>% ggplot(aes(population, total)) + geom_label()

will give us the error message: Error: geom_label requires the following missing aesthetics: label

Why is this?

  1. Rewrite the code from the previous exercise to add the state abbreviation as the label through aes.
library(dplyr)
library(ggplot2)
library(dslabs)
data(murders)
## edit the next line to add the label
murders %>% ggplot(aes(population, total, label=abb)) +
  geom_point()+
  geom_label()

index

  1. geom_point colors

Change the color of the labels through blue. How will we do this?

  1. Rewrite the code above to make the labels blue.
murders %>% ggplot(aes(population, total,label= abb)) +
  geom_label(color='blue')

index

  1. geom_labels by region

Now suppose we want to use color to represent the different regions. In this case which of the following is most appropriate:

  1. geom_label colors

Rewrite the code above to make the label color correspond to the state’s region.

## edit this code
murders %>% ggplot(aes(population, total, label = abb,color=region)) +
  geom_label()

index

  1. Log-scale

Now we are going to change the x-axis to a log scale to account for the fact the distribution of population is skewed. Let’s start by define an object p holding the plot we have made up to now

p <- murders %>% 
  ggplot(aes(population, total, label = abb, color = region)) +
  geom_label() 

To change the y-axis to a log scale we learned about the scale_x_log10() function. Add this layer to the object p to change the scale and render the plot.

p <- murders %>% ggplot(aes(population, total, label = abb, color = region)) + geom_label()

p + scale_x_log10()

Unknown

#Repeat the previous exercise but now change both axes to be in the log scale.
p + scale_x_log10() + scale_y_log10()

Unknown-2

  1. Titles

Now edit the code above to add the title “Gun murder data” to the plot. Hint: use the ggtitle function.

p <- murders %>% ggplot(aes(population, total, label = abb, color = region)) +
  geom_label()
# add a layer to add title to the next line
p + scale_x_log10() + 
    scale_y_log10() + ggtitle("Gun murder data")

Unknown

  1. Histograms

We are going to shift our focus from the murders dataset to explore the heights dataset.

We use the geom_histogram function to make a histogram of the heights in the heights data frame. When reading the documentation for this function we see that it requires just one mapping, the values to be used for the histogram.

What is the variable containing the heights in inches in the heights data frame?

  1. A second example

We are now going to make a histogram of the heights so we will load the heights dataset. The following code has been pre-run for you to load the heights dataset:

Create a ggplot object called p using the pipe to assign the heights data to a ggplot object. Assign height to the x values through the aes function.

# define p here
p <- heights %>% ggplot(aes(height))
  1. Histograms 2

Now we are ready to add a layer to actually make the histogram.

Add a layer to the object p (created in the previous exercise) using the geom_histogram function to make the histogram.

p <- heights %>% 
  ggplot(aes(height))
## add a layer to p
p + geom_histogram()

Unknown

  1. Histogram binwidth

Note that when we run the code from the previous exercise we get the following warning:

stat_bin() using bins = 30. Pick better value with binwidth.

Use the binwidth argument to change the histogram made in the previous exercise to use bins of size 1 inch.

p <- heights %>% 
  ggplot(aes(height))
## add the geom_histogram layer but with the requested argument
p + geom_histogram(binwidth=1)

Unknown

  1. Smooth density plot

Now instead of a histogram we are going to make a smooth density plot. In this case, we will not make an object p. Instead we will render the plot using a single line of code. In the previous exercise, we could have created a histogram using one line of code like this:

heights %>% 
  ggplot(aes(height)) +
  geom_histogram()

Unknown

Now instead of geom_histogram we will use geom_density to create a smooth density plot. Add the appropriate layer to create a smooth density plot of heights.

## add the correct layer using +
heights %>% 
  ggplot(aes(height)) +
  geom_density()

Unknown

  1. Two smooth density plots

Now we are going to make density plots for males and females separately. We can do this using the group argument within the aes mapping. Because each point will be assigned to a different density depending on a variable from the dataset, we need to map within aes. Create separte smooth density plots for males and females by defining group by sex.

## add the group argument then a layer with +
heights %>% 
  ggplot(aes(height,group = sex)) + geom_density()

index

  1. Two smooth density plots 2

In the previous exercise we made the two density plots, one for each sex, using:

heights %>% 
  ggplot(aes(height, group = sex)) + 
  geom_density()

index

We can also assign groups through the color or fill argument. For example, if you type color = sex ggplot knows you want a different color for each sex. So two densities must be drawn. You can therefore skip the group = sex mapping. Using color has the added benefit that it uses color to distinguish the groups. Change the density plots from the previous exercise to add color.

## edit the next line to use color instead of group then add a density layer
heights %>% 
  ggplot(aes(height, color = sex))+
  geom_density()

index

  1. Two smooth density plots 3

We can also assign groups using the fill argument. When using the geom_density geometry, color creates a colored line for the smooth density plot while fill colors in the area under the curve.

We can see what this looks like by running the following code:

heights %>% 
  ggplot(aes(height, fill = sex)) + 
  geom_density() 

index

However, here the second density is drawn over the other. We can change this by using something called alpha blending. Set the alpha parameter to 0.2 in the geom_density function to make this change.

heights %>% 
  ggplot(aes(height, fill = sex)) + 
  geom_density(alpha=0.2) 

index

Section 3 Overview

Section 3 introduces you to summarizing with dplyr.

After completing Section 3, you will: - understand the importance of summarizing data in exploratory data analysis. - be able to use the “summarize” verb in dplyr to facilitate summarizing data. - be able to use the “group_by” verb in dplyr to facilitate summarizing data. - be able to access values using the dot placeholder. - be able to use “arrange” to examine data after sorting.

The textbook for this section is available here

Assessment 7 (Summarizing with dplyr)

Practice Exercise. National Center for Health Statistics

To practice our dplyr skills we will be working with data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s.

Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and then they complete the health examination component of the survey. Part of this dataset is made available via the NHANES package which can be loaded this way:

library(NHANES)
data(NHANES)

The NHANES data has many missing values. Remember that the main summarization function in R will return NA if any of the entries of the input vector is an NA. Here is an example:

library(dslabs)
data(na_example)
mean(na_example)
## [1] NA
sd(na_example)
## [1] NA

To ignore the NAs, we can use the na.rm argument:

mean(na_example, na.rm = TRUE)
## [1] 2.301754
sd(na_example, na.rm = TRUE)
## [1] 1.22338
  1. Blood pressure 1

Let’s explore the NHANES data. We will be exploring blood pressure in this dataset.

First let’s select a group to set the standard. We will use 20-29 year old females. Note that the category is coded with 20-29, with a space in front of the 20! The AgeDecade is a categorical variable with these ages.

To know if someone is female, you can look at the Gender variable.

Filter the NHANES dataset so that only 20-29 year old females are included and assign this new data frame to the object tab. Use the pipe to apply the function filter, with the appropriate logicals, to NHANES. Remember that this age group is coded with 20-29, which includes a space. You can use head to explore the NHANES table to construct the correct call to filter.

library(dplyr)
library(NHANES)
data(NHANES)
## fill in what is needed
tab <- NHANES %>% filter(AgeDecade==" 20-29" & Gender=="female")
  1. Blood pressure 2

Now we will compute the average and standard deviation for the subgroup we defined in the previous exercise (20-29 year old females), which we will use reference for what is typical.

You will determine the average and standard deviation of systolic blood pressure, which are stored in the BPSysAve variable in the NHANES dataset.

Complete the line of code to save the average and standard deviation of systolic blood pressure as average and standard_deviation to a variable called ref. Use the summarize function after filtering for 20-29 year old females and connect the results using the pipe %>%. When doing this remember there are NAs in the data!

library(dplyr)
library(NHANES)
data(NHANES)
## complete this line of code.
ref <- NHANES %>% filter(AgeDecade == " 20-29" & Gender == "female") %>% summarize(average=mean(BPSysAve,na.rm=TRUE), standard_deviation=sd(BPSysAve,na.rm=TRUE))
  1. Summarizing averages

Now we will repeat the exercise and generate only the average blood pressure for 20-29 year old females. For this exercise, you should review how to use the place holder . in dplyr.

Modify the line of sample code to assign the average to a numeric variable called ref_avg.

library(dplyr)
library(NHANES)
data(NHANES)
## modify the code we wrote for previous exercise.
ref_avg <- NHANES %>%
  filter(AgeDecade == " 20-29" & Gender == "female") %>%
  summarize(average = mean(BPSysAve, na.rm = TRUE), 
            standard_deviation = sd(BPSysAve, na.rm=TRUE)) %>% .$average
  1. Min and max

Let’s continue practicing by calculating two other data summaries: the minimum and the maximum.

Again we will do it for the BPSysAve variable and the group of 20-29 year old females.

Report the min and max values for the same group as in the previous exercises. Use filter and summarize connected by the pipe %>% again. The functions min and max can be used to get the values you want. Within summarize, save the min and max of systolic blood pressure as min and max.

library(dplyr)
library(NHANES)
data(NHANES)
## complete the line
NHANES %>%
      filter(AgeDecade == " 20-29"  & Gender == "female") %>% summarize(min=min(BPSysAve,na.rm=TRUE),max=max(BPSysAve,na.rm=TRUE))
min     max
<int>   <int>
84  179
1 row
  1. group_by

Now let’s practice using the group_by function.

What we are about to do is a very common operation in data science: you will split a data table into groups and then compute summary statistics for each group.

We will compute the average and standard deviation of systolic blood pressure for females for each age group separately. Remember that the age groups are contained in AgeDecade.

Use the functions filter, group_by, summarize, and the pipe %>% to compute the average and standard deviation of systolic blood pressure for females for each age group separately.

Within summarize, save the average and standard deviation of systolic blood pressure (BPSysAve) as average and standard_deviation.

library(dplyr)
library(NHANES)
data(NHANES)
##complete the line with group_by and summarize
NHANES %>%
      filter(Gender == "female") %>% group_by(AgeDecade) %>% summarize(average=mean(BPSysAve,na.rm=TRUE),standard_deviation = sd(BPSysAve,na.rm=TRUE))
AgeDecade       average      standard_deviation
<fctr>          <dbl>        <dbl>
0-9         99.95041     9.071798
10-19           104.27466    9.461431
20-29           108.42243    10.146681
30-39           111.25512    12.314790
40-49           115.49385    14.530054
50-59           121.84245    16.179333
60-69           127.17787    17.125713
70+         133.51652    19.841781
NA          141.54839    22.908521
9 rows
  1. group_by example 2

Now let’s practice using group_by some more. We are going to repeat the previous exercise of calculating the average and standard deviation of systolic blood pressure, but for males instead of females.

This time we will not provide much sample code. You are on your own!

Calculate the average and standard deviation of systolic blood pressure for males for each age group separately using the same methods as in the previous exercise.

library(dplyr)
library(NHANES)
data(NHANES)
NHANES %>%
      filter(Gender == "male") %>% group_by(AgeDecade) %>% summarize(average=mean(BPSysAve,na.rm=TRUE),standard_deviation = sd(BPSysAve,na.rm=TRUE))
AgeDecade       average      standard_deviation
<fctr>          <dbl>        <dbl>
0-9         97.41912     8.317367
10-19           109.59789    11.227769
20-29           117.85084    11.274795
30-39           119.40063    12.306656
40-49           120.78390    13.968338
50-59           125.75000    17.760536
60-69           126.88578    17.478117
70+         130.20172    18.657475
NA          136.40000    23.534731
9 rows
  1. group_by example 3

We can actually combine both of these summaries into a single line of code. This is because group_by permits us to group by more than one variable.

We can use group_by(AgeDecade, Gender) to group by both age decades and gender.

Create a single summary table for the average and standard deviation of systolic blood pressure using group_by(AgeDecade, Gender).

Note that we no longer have to filter!

Your code within summarize should remain the same as in the previous exercises.

library(NHANES)
data(NHANES)
NHANES %>% group_by(AgeDecade, Gender) %>% summarize(average=mean(BPSysAve,na.rm=TRUE),standard_deviation = sd(BPSysAve,na.rm=TRUE))
AgeDecade    Gender     average      standard_deviation
<fctr>       <fctr>     <dbl>        <dbl>
0-9      female 99.95041     9.071798
0-9      male   97.41912     8.317367
10-19        female 104.27466    9.461431
10-19        male   109.59789    11.227769
20-29        female 108.42243    10.146681
20-29        male   117.85084    11.274795
30-39        female 111.25512    12.314790
30-39        male   119.40063    12.306656
40-49        female 115.49385    14.530054
40-49        male   120.78390    13.968338
1-10 of 18 rows
  1. Arrange

Now we are going to explore differences in systolic blood pressure across races, as reported in the Race1 variable.

We will learn to use the arrange function to order the outcome acording to one variable.

Note that this function can be used to order any table by a given outcome. Here is an example that arranges by systolic blood pressure.

NHANES %>% arrange(BPSysAve)
ID        SurveyYr    Gender     Age        AgeDecade     AgeMonths    Race1      Race3       Education
<int>     <fctr>      <fctr>     <int>      <fctr>        <int>        <fctr>     <fctr>      <fctr>
53661     2009_10     male   12     10-19     145          Mexican    NA          NA    
58821     2009_10     female     51     50-59     613          White      NA          9 - 11th Grade    
67253     2011_12     female     10     10-19     NA           White      White       NA    
51893     2009_10     female     70     70+           843          White      NA          9 - 11th Grade    
54375     2009_10     male   76     70+           912          White      NA          High School   
54375     2009_10     male   76     70+           912          White      NA          High School   
58941     2009_10     male   8      0-9           104          Mexican    NA          NA    
67616     2011_12     male   63     60-69     NA           White      White       9 - 11th Grade    
67616     2011_12     male   63     60-69     NA           White      White       9 - 11th Grade    
67616     2011_12     male   63     60-69     NA           White      White       9 - 11th Grade    
...
1-10 of 10,000 rows | 1-9 of 76 columns

If we want it in descending order we can use the desc function like this:

NHANES %>% arrange(desc(BPSysAve))
ID        SurveyYr    Gender     Age        AgeDecade     AgeMonths    Race1      Race3       Education
<int>     <fctr>      <fctr>     <int>      <fctr>        <int>        <fctr>     <fctr>      <fctr>
55311     2009_10     female     55     50-59     671          Hispanic   NA          Some College  
67957     2011_12     male   50     50-59     NA           Black      Black       9 - 11th Grade    
67957     2011_12     male   50     50-59     NA           Black      Black       9 - 11th Grade    
62727     2011_12     female     80     NA            NA           White      White       Some College  
62727     2011_12     female     80     NA            NA           White      White       Some College  
53371     2009_10     male   68     60-69     817          White      NA          9 - 11th Grade    
58276     2009_10     female     80     NA            NA           White      NA          College Grad  
65475     2011_12     female     44     40-49     NA           Black      Black       High School   
60848     2009_10     male   80     NA            NA           Other      NA          College Grad  
68301     2011_12     male   59     50-59     NA           White      White       High School   
...
1-10 of 10,000 rows | 1-9 of 76 columns

In this example, we will compare systolic blood pressure across values of the Race1 variable for males between the ages of 40-49.

Compute the average and standard deviation for each value of Race1 for males in the age decade 40-49. Order the resulting table from lowest to highest average systolic blood pressure. Use the functions filter, group_by, summarize, arrange, and the pipe %>% to do this in one line of code. Within summarize, save the average and standard deviation of systolic blood pressure as average and standard_deviation.

library(dplyr)
library(NHANES)
data(NHANES)
NHANES %>%
      filter(AgeDecade ==" 40-49" & Gender == "male")  %>% group_by(Race1) %>% summarize(average=mean(BPSysAve,na.rm=TRUE),standard_deviation = sd(BPSysAve,na.rm=TRUE))%>% arrange(average)
Race1       average        standard_deviation
<fctr>      <dbl>          <dbl>
White       119.9188       13.42355
Other       120.4000       16.20241
Hispanic    121.6098       11.06770
Mexican     121.8500       13.93756
Black       125.8387       17.06707
5 rows

Section 4 Overview

In Section 4, you will look at a case study involving data from the Gapminder Foundation about trends in world health and economics.

After completing Section 4, you will: - understand how Hans Rosling and the Gapminder Foundation use effective data visualization to convey data-based trends. - be able to apply the ggplot2 techniques from the previous section to answer questions using data. - understand how fixed scales across plots can ease comparisons. - be able to modify graphs to improve data visualization.

The textbook for this section is available here

Assessment 8 (Exploring the Gapminder Dataset)

  1. Life expectancy vs fertility - part 1

The Gapminder Foundation is a non-profit organization based in Sweden that promotes global development through the use of statistics that can help reduce misconceptions about global development.

library(dplyr)
library(ggplot2)
library(dslabs)
data(gapminder)
head(gapminder)
   country             year       infant_mortality      life_expectancy     fertility     population
   <fctr>              <int>      <dbl>                 <dbl>               <dbl>         <dbl>
1  Albania         1960   115.40            62.87               6.19      1636054   
2  Algeria         1960   148.20            47.50               7.65      11124892  
3  Angola          1960   208.00            35.98               7.32      5270844   
4  Antigua and Barbuda 1960   NA                    62.97               4.43      54681 
5  Argentina           1960   59.87                 65.39               3.11      20619075  
6  Armenia         1960   NA                    66.86               4.55      1867396   
6 rows | 1-7 of 10 columns
## fill out the missing parts in filter and aes
gapminder %>% filter(continent=="Africa" & year=="2012") %>%
  ggplot(aes(fertility,life_expectancy)) +
  geom_point()

index

  1. Life expectancy vs fertility - part 2 - coloring your plot

Note that there is quite a bit of variability in life expectancy and fertility with some African countries having very high life expectancies. There also appear to be three clusters in the plot.

Remake the plot from the previous exercises but this time use color to dinstinguish the different regions of Africa to see if this explains the clusters. Remember that you can explore the gapminder data to see how the regions of Africa are labeled in the dataframe!

library(dplyr)
library(ggplot2)
library(dslabs)
data(gapminder)

gapminder %>% filter(continent=="Africa" & year=="2012") %>%
  ggplot(aes(fertility,life_expectancy, color=region)) +
  geom_point()

index

  1. Life expectancy vs fertility - part 3 - selecting country and region

While many of the countries in the high life expectancy/low fertility cluster are from Northern Africa, three countries are not.

library(dplyr)
library(dslabs)
data(gapminder)
head(gapminder)
   country             year       infant_mortality      life_expectancy     fertility     population
   <fctr>              <int>      <dbl>                 <dbl>               <dbl>         <dbl>
1  Albania         1960   115.40            62.87               6.19      1636054   
2  Algeria         1960   148.20            47.50               7.65      11124892  
3  Angola          1960   208.00            35.98               7.32      5270844   
4  Antigua and Barbuda 1960   NA                    62.97               4.43      54681 
5  Argentina           1960   59.87                 65.39               3.11      20619075  
6  Armenia         1960   NA                    66.86               4.55      1867396   
6 rows | 1-7 of 10 columns
df <- gapminder %>% filter(continent=="Africa" & year=="2012" & fertility <=3 & life_expectancy>=70) %>%
  select(country,region)
  1. Life expectancy and the Vietnam War - part 1

The Vietnam War lasted from 1955 to 1975. Do the data support war having a negative effect on life expectancy? We will create a time series plot that covers the period from 1960 to 2010 of life expectancy for Vietnam and the United States, using color to distinguish the two countries. In this start we start the analysis by generating a table.

library(dplyr)
library(dslabs)
data(gapminder)

head(gapminder)
   country             year       infant_mortality      life_expectancy     fertility     population
   <fctr>              <int>      <dbl>                 <dbl>               <dbl>         <dbl>
1  Albania         1960   115.40            62.87               6.19      1636054   
2  Algeria         1960   148.20            47.50               7.65      11124892  
3  Angola          1960   208.00            35.98               7.32      5270844   
4  Antigua and Barbuda 1960   NA                    62.97               4.43      54681 
5  Argentina           1960   59.87                 65.39               3.11      20619075  
6  Armenia         1960   NA                    66.86               4.55      1867396   
6 rows | 1-7 of 10 columns
tab <- gapminder %>% filter(year>=1960 & year<=2010 & country%in%c("Vietnam","United States"))
  1. Life expectancy and the Vietnam War - part 2

Now that you have created the data table in Exercise 4, it is time to plot the data for the two countries.

p <- tab %>% ggplot(aes(year,life_expectancy,color=country)) + geom_line()
p

index

  1. Life expectancy in Cambodia

Cambodia was also involved in this conflict and, after the war, Pol Pot and his communist Khmer Rouge took control and ruled Cambodia from 1975 to 1979. He is considered one of the most brutal dictators in history. Do the data support this claim?

Use a single line of code to create a time series plot from 1960 to 2010 of life expectancy vs year for Cambodia.

library(dplyr)
library(ggplot2)
library(dslabs)
data(gapminder)

gapminder %>% filter(year>=1960 & year <= 2010 & country=="Cambodia") %>% ggplot(aes(year,life_expectancy)) + geom_line()

index

  1. Dollars per day - part 1

Now we are going to calculate and plot dollars per day for African countries in 2010 using GDP data.

In the first part of this analysis, we will create the dollars per day variable. - Use mutate to create a dollars_per_day variable, which is defined as gdp/population/365. - Create the dollars_per_day variable for African countries for the year 2010. - Remove any NA values. - Save the mutated dataset as daydollars.

library(dplyr)
library(dslabs)
data(gapminder)
daydollars <- gapminder %>% mutate(dollars_per_day=gdp/population/365)%>% filter(year==2010 & continent=="Africa" & !is.na(dollars_per_day))
  1. Dollars per day - part 2

Now we are going to calculate and plot dollars per day for African countries in 2010 using GDP data.

In the second part of this analysis, we will plot the smooth density plot using a log (base 2) x axis. - The dataset including the dollars_per_day variable is preloaded as daydollars. - Create a smooth density plot of dollars per day from daydollars. - Use a log (base 2) scale for the x axis.

daydollars %>% ggplot(aes(dollars_per_day)) + geom_density() + scale_x_continuous(trans="log2")

index

  1. Dollars per day - part 3 - multiple density plots

Now we are going to combine the plotting tools we have used in the past two exercises to create density plots for multiple years. - Create the dollars_per_day variable as in Exercise 7, but for African countries in the years 1970 and 2010 this time. - Make sure you remove any NA values. - Create a smooth density plot of dollars per day for 1970 and 2010 using a log (base 2) scale for the x axis. - Use facet_grid to show a different density plot for 1970 and 2010.

library(dplyr)
library(ggplot2)
library(dslabs)
data(gapminder)

daydollars <- gapminder %>% mutate(dollars_per_day=gdp/population/365)%>% filter(year %in% c(1970,2010) & continent=="Africa" & !is.na(dollars_per_day))

daydollars %>% ggplot(aes(dollars_per_day)) + geom_density() + scale_x_continuous(trans='log2') + facet_grid(.~year)

index

  1. Dollars per day - part 4 - stacked histograms

Now we are going to edit the code from Exercise 9 to show stacked histograms of each region in Africa.

Much of the code will be the same as in Exercise 9: - Create the dollars_per_day variable as in Exercise 7, but for African countries in the years 1970 and 2010 this time. - Make sure you remove any NA values. - Create a smooth density plot of dollars per day for 1970 and 2010 using a log (base 2) scale for the x axis. - Use facet_grid to show a different density plot for 1970 and 2010. - Make sure the densities are smooth by using bw = 0.5. - Use the fill and position arguments where appropriate to create the stacked histograms of each region.

library(dplyr)
library(ggplot2)
library(dslabs)
data(gapminder)

daydollars <- gapminder %>% mutate(dollars_per_day=gdp/population/365)%>% filter(year %in% c(1970,2010) & continent=="Africa" & !is.na(dollars_per_day))

daydollars %>% ggplot(aes(dollars_per_day,fill = region)) + geom_density(bw=0.5,position='stack') + scale_x_continuous(trans='log2') + facet_grid(.~year)

index

  1. Infant mortality scatter plot - part 1

We are going to continue looking at patterns in the gapminder dataset by plotting infant mortality rates versus dollars per day for African countries. - Generate dollars_per_day using mutate and filter for the year 2010 for African countries. - Remember to remove NA values. - Store the mutated dataset in gapminder_Africa_2010. - Make a scatter plot of infant_mortaility versus dollars_per_day for countries in the African continent. - Use color to denote the different regions of Africa.

library(dplyr)
library(ggplot2)
library(dslabs)
data(gapminder)
gapminder_Africa_2010 <- daydollars <- gapminder %>% mutate(dollars_per_day=gdp/population/365)%>% filter(year %in% c(2010) & continent=="Africa" & !is.na(dollars_per_day))

# now make the scatter plot

gapminder_Africa_2010 %>% ggplot(aes(dollars_per_day,infant_mortality,color = region)) + geom_point()

index

  1. Infant mortality scatter plot - part 2 - logarithmic axis

Now we are going to transform the x axis of the plot from the previous exercise. - The mutated dataset is preloaded as gapminder_Africa_2010. - As in the previous exercise, make a scatter plot of infant_mortaility versus dollars_per_day for countries in the African continent. - As in the previous exercise, use color to denote the different regions of Africa. - Transform the x axis to be in the log (base 2) scale.

gapminder_Africa_2010 %>% ggplot(aes(dollars_per_day,infant_mortality, color=region)) + geom_point() + scale_x_continuous(trans='log2')

index

  1. Infant mortality scatter plot - part 3 - adding labels

Note that there is a large variation in infant mortality and dollars per day among African countries.

As an example, one country has infant mortality rates of less than 20 per 1000 and dollars per day of 16, while another country has infant mortality rates over 10% and dollars per day of about 1.

In this exercise, we will remake the plot from Exercise 12 with country names instead of points so we can identify which countries are which. - The mutated dataset is preloaded as gapminder_Africa_2010. - As in the previous exercise, make a scatter plot of infant_mortaility versus dollars_per_day for countries in the African continent. - As in the previous exercise, use color to denote the different regions of Africa. - As in the previous exercise, transform the x axis to be in the log (base 2) scale. - Add a layer to display country names instead of points.

gapminder_Africa_2010 %>% ggplot(aes(dollars_per_day,infant_mortality, color=region,label = country)) + geom_point() + scale_x_continuous(trans='log2') +
    geom_text()

index

  1. Infant mortality scatter plot - part 4 - comparison of scatter plots

Now we are going to look at changes in the infant mortality and dollars per day patterns African countries between 1970 and 2010. - Generate dollars_per_day using mutate and filter for the years 1970 and 2010 for African countries. - Remember to remove NA values. - As in the previous exercise, make a scatter plot of infant_mortaility versus dollars_per_day for countries in the African continent. - As in the previous exercise, use color to denote the different regions of Africa. - As in the previous exercise, transform the x axis to be in the log (base 2) scale. - As in the previous exercise, add a layer to display country names instead of points. - Use facet_grid to show different plots for 1970 and 2010.

library(dplyr)
library(ggplot2)
library(dslabs)
data(gapminder)
gapminder  %>%
    filter(continent == "Africa" & year %in% c(1970,2010) & !is.na(gdp) & !is.na(year) & !is.na(infant_mortality)) %>%
    mutate(dollars_per_day = gdp/population/365) %>%
    ggplot(aes(dollars_per_day, infant_mortality, color = region,label = country)) +
    geom_point() +
    scale_x_continuous(trans = "log2") +
    geom_text() +
    facet_grid(year~.)

index

Section 5 Overview

Section 5 covers some general principles that can serve as guides for effective data visualization.

After completing Section 5, you will: - understand basic principles of effective data visualization. - understand the importance of keeping your goal in mind when deciding on a visualization approach. - understand principles for encoding data, including position, aligned lengths, angles, area, brightness, and color hue. - know when to include the number zero in visualizations. - be able to use techniques to ease comparisons, such as using common axes, putting visual cues to be compared adjacent to one another, and using color effectively.

The textbook for this section is available here

Assessment 9 (Data Visualization Principles, Part 1)

1: Customizing plots - Pie charts

Pie charts are appropriate: - [ ] A. When we want to display percentages. - [ ] B. When ggplot2 is not available. - [ ] C. When I am in a bakery. - [X] D. Never. Barplots and tables are always better.

  1. Customizing plots - What’s wrong?

What is the problem with this plot?

index

  1. Customizing plots - What’s wrong 2?.

Take a look at the following two plots. They show the same information: rates of measles by state in the United States for 1928.

index

Assessment 10 (Data Visualization Principles, Part 2)

1: Customizing plots - watch and learn

To make the plot on the right in the exercise from the last set of assessments, we had to reorder the levels of the states’ variables. - Redefine the state object so that the levels are re-ordered by rate. - Print the new object state and its levels so you can see that the vector is now re-ordered by the levels.

library(dplyr)
library(ggplot2)
library(dslabs)
dat <- us_contagious_diseases %>%
filter(year == 1967 & disease=="Measles" & !is.na(population)) %>% mutate(rate = count / population * 10000 * 52 / weeks_reporting)
state <- dat$state 
rate <- dat$count/(dat$population/10000)*(52/dat$weeks_reporting)

state <- reorder(state,rate)
print(state)
##  [1] Alabama              Alaska               Arizona             
##  [4] Arkansas             California           Colorado            
##  [7] Connecticut          Delaware             District Of Columbia
## [10] Florida              Georgia              Hawaii              
## [13] Idaho                Illinois             Indiana             
## [16] Iowa                 Kansas               Kentucky            
## [19] Louisiana            Maine                Maryland            
## [22] Massachusetts        Michigan             Minnesota           
## [25] Mississippi          Missouri             Montana             
## [28] Nebraska             Nevada               New Hampshire       
## [31] New Jersey           New Mexico           New York            
## [34] North Carolina       North Dakota         Ohio                
## [37] Oklahoma             Oregon               Pennsylvania        
## [40] Rhode Island         South Carolina       South Dakota        
## [43] Tennessee            Texas                Utah                
## [46] Vermont              Virginia             Washington          
## [49] West Virginia        Wisconsin            Wyoming             
## attr(,"scores")
##              Alabama               Alaska              Arizona 
##           4.16107582           5.46389893           6.32695891 
##             Arkansas           California             Colorado 
##           6.87899954           2.79313560           7.96331905 
##          Connecticut             Delaware District Of Columbia 
##           0.36986840           1.13098183           0.35873614 
##              Florida              Georgia               Hawaii 
##           2.89358806           0.09987991           2.50173748 
##                Idaho             Illinois              Indiana 
##           6.03115170           1.20115480           1.34027323 
##                 Iowa               Kansas             Kentucky 
##           2.94948911           0.66386422           4.74576011 
##            Louisiana                Maine             Maryland 
##           0.46088071           2.57520433           0.49922233 
##        Massachusetts             Michigan            Minnesota 
##           0.74762338           1.33466700           0.37722410 
##          Mississippi             Missouri              Montana 
##           3.11366532           0.75696354           5.00433320 
##             Nebraska               Nevada        New Hampshire 
##           3.64389801           6.43683882           0.47181511 
##           New Jersey           New Mexico             New York 
##           0.88414264           6.15969926           0.66849058 
##       North Carolina         North Dakota                 Ohio 
##           1.92529764          14.48024642           1.16382241 
##             Oklahoma               Oregon         Pennsylvania 
##           3.27496900           8.75036439           0.67687303 
##         Rhode Island       South Carolina         South Dakota 
##           0.68207448           2.10412531           0.90289534 
##            Tennessee                Texas                 Utah 
##           5.47344506          12.49773953           4.03005836 
##              Vermont             Virginia           Washington 
##           1.00970314           5.28270939          17.65180349 
##        West Virginia            Wisconsin              Wyoming 
##           8.59456463           4.96246019           6.97303449 
## 51 Levels: Georgia District Of Columbia Connecticut ... Washington
levels(state)
##  [1] "Georgia"              "District Of Columbia" "Connecticut"         
##  [4] "Minnesota"            "Louisiana"            "New Hampshire"       
##  [7] "Maryland"             "Kansas"               "New York"            
## [10] "Pennsylvania"         "Rhode Island"         "Massachusetts"       
## [13] "Missouri"             "New Jersey"           "South Dakota"        
## [16] "Vermont"              "Delaware"             "Ohio"                
## [19] "Illinois"             "Michigan"             "Indiana"             
## [22] "North Carolina"       "South Carolina"       "Hawaii"              
## [25] "Maine"                "California"           "Florida"             
## [28] "Iowa"                 "Mississippi"          "Oklahoma"            
## [31] "Nebraska"             "Utah"                 "Alabama"             
## [34] "Kentucky"             "Wisconsin"            "Montana"             
## [37] "Virginia"             "Alaska"               "Tennessee"           
## [40] "Idaho"                "New Mexico"           "Arizona"             
## [43] "Nevada"               "Arkansas"             "Wyoming"             
## [46] "Colorado"             "West Virginia"        "Oregon"              
## [49] "Texas"                "North Dakota"         "Washington"
  1. Customizing plots - redefining

Now we are going to customize this plot a little more by creating a rate variable and reordering by that variable instead. - Add a single line of code to the definition of the dat table that uses mutate to reorder the states by the rate variable. - The sample code provided will then create a bar plot using the newly defined dat.

library(dplyr)
library(ggplot2)
library(dslabs)
data(us_contagious_diseases)
dat <- us_contagious_diseases %>% filter(year == 1967 & disease=="Measles" & count>0 & !is.na(population)) %>%
  mutate(rate = count / population * 10000 * 52 / weeks_reporting) %>% mutate(state = reorder(state, rate))
dat %>% ggplot(aes(state, rate)) +
  geom_bar(stat="identity") +
  coord_flip()

index

  1. Showing the data and customizing plots

Say we are interested in comparing gun homicide rates across regions of the US. We see this plot:

library(dplyr)
library(ggplot2)
library(dslabs)
data("murders")
murders %>% mutate(rate = total/population*100000) %>%
  group_by(region) %>%
  summarize(avg = mean(rate)) %>%
  mutate(region = factor(region)) %>%
  ggplot(aes(region, avg)) +
  geom_bar(stat="identity") +
  ylab("Murder Rate Average")

index

and decide to move to a state in the western region. What is the main problem with this interpretaion? - [ ] A. The categories are ordered alphabetically. - [ ] B. The graph does not show standard errors. - [X] C. It does not show all the data. We do not see the variability within a region and it’s possible that the safest states are not in the West. - [ ] D. The Northeast has the lowest average.

  1. Making a box plot

To further investigate whether moving to the western region is a wise decision, let’s make a box plot of murder rates by region, showing all points. - Make a box plot of the murder rates by region. - Order the regions by their median murder rate. - Show all of the points on the box plot.

library(dplyr)
library(ggplot2)
library(dslabs)
data("murders")
murders %>% mutate(rate = total/population*100000) %>%
  mutate(region=reorder(region, rate, FUN=median)) %>%
  ggplot(aes(region, rate)) +
  geom_boxplot() +
  geom_point()

index

Assessment 11 (Data Visualization Principles, Part 3)

  1. Tile plot - measles and smallpox

The sample code given creates a tile plot showing the rate of measles cases per population. We are going to modify the tile plot to look at smallpox cases instead.

library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(dslabs)
data(us_contagious_diseases)
head(us_contagious_diseases)
   disease       state      year      weeks_reporting     count      population
   <fctr>        <fctr>     <dbl>     <int>               <dbl>      <dbl>
1  Hepatitis A   Alabama    1966      50              321        3345787
2  Hepatitis A   Alabama    1967      49              291        3364130
3  Hepatitis A   Alabama    1968      52              314        3386068
4  Hepatitis A   Alabama    1969      49              380        3412450
5  Hepatitis A   Alabama    1970      51                  413        3444165
6  Hepatitis A   Alabama    1971      51              378        3481798
6 rows
the_disease = "Measles"
dat <- us_contagious_diseases %>% 
   filter(!state%in%c("Hawaii","Alaska") & disease == the_disease) %>% 
   mutate(rate = count / population * 10000) %>% 
   mutate(state = reorder(state, rate))

dat %>% ggplot(aes(year, state, fill = rate)) + 
  geom_tile(color = "grey50") + 
  scale_x_continuous(expand=c(0,0)) + 
  scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") + 
  theme_minimal() + 
  theme(panel.grid = element_blank()) + 
  ggtitle(the_disease) + 
  ylab("") + 
  xlab("")

index

library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(dslabs)
data(us_contagious_diseases)
head(us_contagious_diseases)
   disease       state      year      weeks_reporting     count      population
   <fctr>        <fctr>     <dbl>     <int>               <dbl>      <dbl>
1  Hepatitis A   Alabama    1966      50              321        3345787
2  Hepatitis A   Alabama    1967      49              291        3364130
3  Hepatitis A   Alabama    1968      52              314        3386068
4  Hepatitis A   Alabama    1969      49              380        3412450
5  Hepatitis A   Alabama    1970      51              413        3444165
6  Hepatitis A   Alabama    1971      51              378        3481798
6 rows
the_disease = "Smallpox"
dat <- us_contagious_diseases %>% 
   filter(!state%in%c("Hawaii","Alaska") & disease == the_disease & !weeks_reporting<10) %>% 
   mutate(rate = count / population * 10000) %>% 
   mutate(state = reorder(state, rate))

dat %>% ggplot(aes(year, state, fill = rate)) + 
  geom_tile(color = "grey50") + 
  scale_x_continuous(expand=c(0,0)) + 
  scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") + 
  theme_minimal() + 
  theme(panel.grid = element_blank()) + 
  ggtitle(the_disease) + 
  ylab("") + 
  xlab("")

index

  1. Time series plot - measles and smallpox

The sample code given creates a time series plot showing the rate of measles cases per population by state. We are going to again modify this plot to look at smallpox cases instead.

library(dplyr)
library(ggplot2)
library(dslabs)
library(RColorBrewer)
data(us_contagious_diseases)

the_disease = "Measles"
dat <- us_contagious_diseases %>%
   filter(!state%in%c("Hawaii","Alaska") & disease == the_disease) %>%
   mutate(rate = count / population * 10000) %>%
   mutate(state = reorder(state, rate))
str(dat)
## 'data.frame':    3724 obs. of  7 variables:
##  $ disease        : Factor w/ 7 levels "Hepatitis A",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ state          : Factor w/ 51 levels "Mississippi",..: 9 9 9 9 9 9 9 9 9 9 ...
##   ..- attr(*, "scores")= num [1:51(1d)] 9.27 NA 24.15 9.37 19.16 ...
##   .. ..- attr(*, "dimnames")=List of 1
##   .. .. ..$ : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ year           : num  1928 1929 1930 1931 1932 ...
##  $ weeks_reporting: int  52 49 52 49 41 51 52 49 40 49 ...
##  $ count          : num  8843 2959 4156 8934 270 ...
##  $ population     : num  2589923 2619131 2646248 2670818 2693027 ...
##  $ rate           : num  34.1 11.3 15.7 33.5 1 ...
avg <- us_contagious_diseases %>%
  filter(disease==the_disease) %>% group_by(year) %>%
  summarize(us_rate = sum(count, na.rm=TRUE)/sum(population, na.rm=TRUE)*10000)

dat %>% ggplot() +
  geom_line(aes(year, rate, group = state),  color = "grey50", 
            show.legend = FALSE, alpha = 0.2, size = 1) +
  geom_line(mapping = aes(year, us_rate),  data = avg, size = 1, color = "black") +
  scale_y_continuous(trans = "sqrt", breaks = c(5,25,125,300)) + 
  ggtitle("Cases per 10,000 by state") + 
  xlab("") + 
  ylab("") +
  geom_text(data = data.frame(x=1955, y=50), mapping = aes(x, y, label="US average"), color="black") + 
  geom_vline(xintercept=1963, col = "blue")

index

library(dplyr)
library(ggplot2)
library(dslabs)
library(RColorBrewer)
data(us_contagious_diseases)

the_disease = "Smallpox"
dat <- us_contagious_diseases %>%
   filter(!state%in%c("Hawaii","Alaska") & disease == the_disease & !weeks_reporting<10) %>%
   mutate(rate = count / population * 10000) %>%
   mutate(state = reorder(state, rate))
str(dat)
## 'data.frame':    1014 obs. of  7 variables:
##  $ disease        : Factor w/ 7 levels "Hepatitis A",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ state          : Factor w/ 51 levels "Rhode Island",..: 17 17 17 17 17 17 17 17 17 17 ...
##   ..- attr(*, "scores")= num [1:51(1d)] 0.382 NA 2.011 0.805 0.924 ...
##   .. ..- attr(*, "dimnames")=List of 1
##   .. .. ..$ : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ year           : num  1928 1929 1930 1931 1932 ...
##  $ weeks_reporting: int  51 52 52 52 52 52 52 52 51 52 ...
##  $ count          : num  341 378 192 295 467 82 23 42 12 54 ...
##  $ population     : num  2589923 2619131 2646248 2670818 2693027 ...
##  $ rate           : num  1.317 1.443 0.726 1.105 1.734 ...
avg <- us_contagious_diseases %>%
  filter(disease==the_disease) %>% group_by(year) %>%
  summarize(us_rate = sum(count, na.rm=TRUE)/sum(population, na.rm=TRUE)*10000)

dat %>% ggplot() +
  geom_line(aes(year, rate, group = state),  color = "grey50", 
            show.legend = FALSE, alpha = 0.2, size = 1) +
  geom_line(mapping = aes(year, us_rate),  data = avg, size = 1, color = "black") +
  scale_y_continuous(trans = "sqrt", breaks = c(5,25,125,300)) + 
  ggtitle("Cases per 10,000 by state") + 
  xlab("") + 
  ylab("") +
  geom_text(data = data.frame(x=1955, y=50), mapping = aes(x, y, label="US average"), color="black") + 
  geom_vline(xintercept=1963, col = "blue")

index

  1. Time series plot - all diseases in California

Now we are going to look at the rates of all diseases in one state. Again, you will be modifying the sample code to produce the desired plot. - For the state of California, make a time series plot showing rates for all diseases. - Include only years with 10 or more weeks reporting. - Use a different color for each disease.

library(dplyr)
library(ggplot2)
library(dslabs)
library(RColorBrewer)
data(us_contagious_diseases)

us_contagious_diseases %>% filter(state=="California" & !weeks_reporting<10) %>% 
  group_by(year, disease) %>%
  summarize(rate = sum(count)/sum(population)*10000) %>%
  ggplot(aes(year, rate,color=disease)) + 
  geom_line()

index

  1. Time series plot - all diseases in the United States

Now we are going to make a time series plot for the rates of all diseases in the United States. For this exercise, we have provided less sample code - you can take a look at the previous exercise to get you started. - Compute the US rate by using summarize to sum over states. - The US rate for each disease will be the total number of cases divided by the total population. - Remember to convert to cases per 10,000. - You will need to filter for !is.na(population) to get all the data. - Plot each disease in a different color.

library(dplyr)
library(ggplot2)
library(dslabs)
library(RColorBrewer)
data(us_contagious_diseases)

us_contagious_diseases %>% filter(!is.na(population)) %>% 
  group_by(year, disease) %>%
  summarize(rate=sum(count)/sum(population)*10000) %>%
  ggplot(aes(year, rate,color=disease)) + geom_line()

index

Properties of Stars Exercises

Background
Astronomy is one of the oldest data-driven sciences. In the late 1800s, the director of the Harvard College Observatory hired women to analyze astronomical data, which at the time was done using photographic glass plates. These women became known as the Harvard Computers. They computed the position and luminosity of various astronomical objects such as stars and galaxies. (If you are interested, you can learn more about the Harvard Computers). Today, astronomy is even more of a data-driven science, with an inordinate amount of data being produced by modern instruments every day.

In the following exercises we will analyze some actual astronomical data to inspect properties of stars, their absolute magnitude (which relates to a star’s luminosity, or brightness), temperature and type (spectral class).

Libraries and Options

#update.packages()
library(tidyverse)
library(dslabs)
data(stars)
options(digits = 3)   # report 3 significant digits

Question 1
Load the stars data frame from dslabs. This contains the name, absolute magnitude, temperature in degrees Kelvin, and spectral class of selected stars. Absolute magnitude (shortened in these problems to simply “magnitude”) is a function of star luminosity, where negative values of magnitude have higher luminosity.

# What is the mean magnitude?
mean(stars$magnitude)
## [1] 4.26
# What is the standard deviation of magnitude?
sd(stars$magnitude)
## [1] 7.35

Question 2
Make a density plot of the magnitude.

stars %>%
  ggplot(aes(magnitude)) +
  geom_density()

# How many peaks are there in the data?
# A: 2

Question 3
Examine the distribution of star temperature. Which of these statements best characterizes the temperature distribution?

stars %>%
  ggplot(aes(temp)) +
  geom_density()

# How many peaks are there in the data?
# A: 2

Question 4
Make a scatter plot of the data with temperature on the x-axis and magnitude on the y-axis and examine the relationship between the variables. Recall that lower magnitude means a more luminous (brighter) star.

stars %>%
  ggplot(aes(x=temp, y=magnitude)) +
  geom_point()

Question 5
For various reasons, scientists do not always follow straight conventions when making plots, and astronomers usually transform values of star luminosity and temperature before plotting. Flip the y-axis so that lower values of magnitude are at the top of the axis (recall that more luminous stars have lower magnitude) using scale_y_reverse. Take the log base 10 of temperature and then also flip the x-axis.
Fill in the blanks in the statements below to describe the resulting plot:
The brighest, highest temperature stars are in the ______________ corner of the plot.

stars %>%
  ggplot(aes(x=log10(temp), y=magnitude)) +
  scale_y_reverse() +
  scale_x_reverse() +
  geom_point()

Question 6
The trends you see allow scientists to learn about the evolution and lifetime of stars. The primary group of stars to which most stars belong (see question 4) we will call the main sequence stars. Most stars belong to this main sequence, however some of the more rare stars are classified as old and evolved stars. These stars tend to be hotter stars, but also have low luminosity, and are known as white dwarfs.

How many white dwarfs are there in our sample?
A: 4

Question 7
Consider stars which are not part of the Main Group but are not old/evolved (white dwarf) stars. These stars must also be unique in certain ways and are known as giants. Use the plot from Question 5 to estimate the average temperature of a giant.

Which of these temperatures is closest to the average temperature of a giant?: A: 5000K

Question 8
We can now identify whether specific stars are main sequence stars, red giants or white dwarfs. Add text labels to the plot to answer these questions. You may wish to plot only a selection of the labels, repel the labels, or zoom in on the plot in RStudio so you can locate specific stars.
Fill in the blanks in the statements below:

library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.0.2
stars %>%
  ggplot(aes(x=log10(temp), y=magnitude, label=star)) +
  scale_y_reverse() +
  scale_x_reverse() +
  geom_point() +
  geom_text(aes(label=star)) +
  geom_text_repel()

# The least lumninous star in the sample with a surface temperature over 5000K is _________.
# A: van Maanens Star
# The two stars with lowest temperature and highest luminosity are known as supergiants. The two supergiants in this dataset are ____________.
# A: Betelgeuse and Antares
# The Sun is a ______________.
# A: main sequence star
stars %>% 
  filter(star=='Sun') %>%
  select_all()
##   star magnitude temp type
## 1  Sun       4.8 5840    G

Question 9
Remove the text labels and color the points by star type. This classification describes the properties of the star’s spectrum, the amount of light produced at various wavelengths.

stars %>%
  ggplot(aes(x=log10(temp), y=magnitude, color=type)) +
  scale_y_reverse() +
  scale_x_reverse() +
  geom_point()

# Which star type has the lowest temperature?

Climate Change Exercises

Background

The planet’s surface temperature is increasing due to human greenhouse gas emissions, and this global warming and carbon cycle disruption is wreaking havoc on natural systems. Living systems that depend on current temperature, weather, currents and carbon balance are jeopardized, and human society will be forced to contend with widespread economic, social, political and environmental damage as the temperature continues to rise. Although most countries recognize that global warming is a crisis and that humans must act to limit its effects, little action has been taken to limit or reverse human impact on the climate.

One limitation is the spread of misinformation related to climate change and its causes, especially the extent to which humans have contributed to global warming. In these exercises, we examine the relationship between global temperature changes, greenhouse gases and human carbon emissions using time series of actual atmospheric and ice core measurements from the National Oceanic and Atmospheric Administration (NOAA) and Carbon Dioxide Information Analysis Center (CDIAC).

Libraries and Options

#update.packages()
library(tidyverse)
library(dslabs)
data(temp_carbon)
data(greenhouse_gases)
data(historic_co2)

Question 1
Load the temp_carbon dataset from dslabs, which contains annual global temperature anomalies (difference from 20th century mean temperature in degrees Celsius), temperature anomalies over the land and ocean, and global carbon emissions (in metric tons). Note that the date ranges differ for temperature and carbon emissions.

Which of these code blocks return the latest year for which carbon emissions are reported?

str(temp_carbon)
## 'data.frame':    268 obs. of  5 variables:
##  $ year            : num  1880 1881 1882 1883 1884 ...
##  $ temp_anomaly    : num  -0.11 -0.08 -0.1 -0.18 -0.26 -0.25 -0.24 -0.28 -0.13 -0.09 ...
##  $ land_anomaly    : num  -0.48 -0.4 -0.48 -0.66 -0.69 -0.56 -0.51 -0.47 -0.41 -0.31 ...
##  $ ocean_anomaly   : num  -0.01 0.01 0 -0.04 -0.14 -0.17 -0.17 -0.23 -0.05 -0.02 ...
##  $ carbon_emissions: num  236 243 256 272 275 277 281 295 327 327 ...
temp_carbon %>%
    .$year %>%
    max()
## [1] 2018
temp_carbon %>%
    filter(!is.na(carbon_emissions)) %>%
    pull(year) %>%
    max()
## [1] 2014
#temp_carbon %>%
#    filter(!is.na(carbon_emissions)) %>%
#    max(year)
temp_carbon %>%
    filter(!is.na(carbon_emissions)) %>%
    .$year %>%
    max()
## [1] 2014
temp_carbon %>%
    filter(!is.na(carbon_emissions)) %>%
    select(year) %>%
    max()
## [1] 2014
#temp_carbon %>%
#    filter(!is.na(carbon_emissions)) %>%
#    max(.$year)

Question 2
Inspect the difference in carbon emissions in temp_carbon from the first available year to the last available year.

# What is the first year for which carbon emissions (carbon_emissions) data are available?
year_min <- temp_carbon %>%
  filter(!is.na(carbon_emissions)) %>%
  .$year %>%
  min()
# What is the last year for which carbon emissions data are available?
year_max <- temp_carbon %>%
  filter(!is.na(carbon_emissions)) %>%
  .$year %>%
  max()
# How many times larger were carbon emissions in the last year relative to the first year?
ratio <- temp_carbon %>%
  filter(year %in% c(year_min, year_max)) %>%
  .$carbon_emissions
#A:
ratio[1] / ratio[2]
## [1] 3285
# Scatter plot
temp_carbon %>%
  filter(!is.na(carbon_emissions)) %>%
  ggplot(aes(x=year, y=carbon_emissions)) +
  geom_point()

Question 3
Inspect the difference in temperature in temp_carbon from the first available year to the last available year.

# What is the first year for which global temperature anomaly (temp_anomaly) data are available?
year_min <- temp_carbon %>%
  filter(!is.na(temp_anomaly)) %>%
  .$year %>%
  min()
year_min
## [1] 1880
# What is the last year for which global temperature anomaly data are available?
year_max <- temp_carbon %>%
  filter(!is.na(temp_anomaly)) %>%
  .$year %>%
  max()
year_max
## [1] 2018
# How many degrees Celsius has temperature increased over the date range?
diff <- temp_carbon %>%
  filter(year %in% c(year_min, year_max)) %>%
  .$temp_anomaly
#A:
diff
## [1] -0.11  0.82
diff[1] - diff[2]
## [1] -0.93

Question 4 Create a time series line plot of the temperature anomaly. Only include years where temperatures are reported. Save this plot to the object p.
Which command adds a blue horizontal line indicating the 20th century mean temperature?

p <- temp_carbon %>%
  filter(!is.na(temp_anomaly)) %>%
  ggplot(aes(year, temp_anomaly)) +
  geom_line() + 
  geom_hline(aes(yintercept=0), color='blue')
p

Question 5
Continue working with p, the plot created in the previous question.

Change the y-axis label to be “Temperature anomaly (degrees C)”. Add a title, “Temperature anomaly relative to 20th century mean, 1880-2018”. Also add a text layer to the plot: the x-coordinate should be 2000, the y-coordinate should be 0.05, the text should be “20th century mean”, and the text color should be blue.

q <- temp_carbon %>%
  filter(!is.na(temp_anomaly)) %>%
  ggplot(aes(year, temp_anomaly)) +
  geom_line() + 
  geom_hline(aes(yintercept=0), color='blue') +
  ylab("Temperature anomaly (degrees C)") +
  ggtitle("Temperature anomaly relative to 20th century mean, 1880-2018") +
  geom_text(aes(x=2000, y=0.05, label="20th century mean"), col='blue')
q

Question 6

When was the earliest year with a temperature above the 20th century mean?

year_min <- temp_carbon %>%
  filter(!is.na(temp_anomaly) & temp_anomaly>0) %>%
  .$year %>%
  min()
year_min
## [1] 1939

When was the last year with an average temperature below the 20th century mean?

year_max <- temp_carbon %>%
  filter(!is.na(temp_anomaly) & temp_anomaly<0) %>%
  .$year %>%
  max()
year_max
## [1] 1976

In what year did the temperature anomaly exceed 0.5 degrees Celsius for the first time?

year_ <- temp_carbon %>%
  filter(!is.na(temp_anomaly) & temp_anomaly>0.5) %>%
  .$year %>%
  min()
year_
## [1] 1997

Question 7 Add layers to the previous plot to include line graphs of the temperature anomaly in the ocean (ocean_anomaly) and on land (land_anomaly). Assign different colors to the lines. Compare the global temperature anomaly to the land temperature anomaly and ocean temperature anomaly.

Which region has the largest 2018 temperature anomaly relative to the 20th century mean?

temp_carbon %>%
  filter(!is.na(temp_anomaly)) %>%
  ggplot(aes(year, temp_anomaly)) +
  geom_line(col='red') + 
  geom_hline(aes(yintercept=0), color='blue') +
  xlim(c(1880, 2018)) +
  ylab("Temperature anomaly (degrees C)") +
  ggtitle("Temperature anomaly relative to 20th century mean, 1880-2018") +
  geom_text(aes(x=2000, y=0.05, label="20th century mean"), col='blue') +
  geom_line(aes(year, ocean_anomaly), col='cyan') +
  geom_line(aes(year, land_anomaly), col='green')

Question 8 A major determinant of Earth’s temperature is the greenhouse effect. Many gases trap heat and reflect it towards the surface, preventing heat from escaping the atmosphere. The greenhouse effect is vital in keeping Earth at a warm enough temperature to sustain liquid water and life; however, changes in greenhouse gas levels can alter the temperature balance of the planet.

The greenhouse_gases data frame from dslabs contains concentrations of the three most significant greenhouse gases: carbon dioxide ( CO2 , abbreviated in the data as co2), methane ( CH4 , ch4 in the data), and nitrous oxide ( N2O , n2o in the data). Measurements are provided every 20 years for the past 2000 years.

str(greenhouse_gases)
## 'data.frame':    300 obs. of  3 variables:
##  $ year         : num  20 40 60 80 100 120 140 160 180 200 ...
##  $ gas          : chr  "CO2" "CO2" "CO2" "CO2" ...
##  $ concentration: num  278 278 277 277 278 ...

Complete the code outline below to make a line plot of concentration on the y-axis by year on the x-axis. Facet by gas, aligning the plots vertically so as to ease comparisons along the year axis. Add a vertical line with an x-intercept at the year 1850, noting the unofficial start of the industrial revolution and widespread fossil fuel consumption. Note that the units for ch4 and n2o are ppb while the units for co2 are ppm.

greenhouse_gases %>%
    ggplot(aes(year, concentration)) +
    geom_line() +
    facet_grid(gas ~ ., scales = "free") +
    geom_vline(xintercept = 1850, col='red') +
    ylab("Concentration (ch4/n2o ppb, co2 ppm)") +
    ggtitle("Atmospheric greenhouse gas concentration by year, 0-2000")

Question 10 Make a time series line plot of carbon emissions (carbon_emissions) from the temp_carbon dataset. The y-axis is metric tons of carbon emitted per year.

temp_carbon %>%
  filter(!is.na(carbon_emissions)) %>%
  ggplot(aes(year, carbon_emissions)) +
  geom_line()

Question 11
We saw how greenhouse gases have changed over the course of human history, but how has CO2 (co2 in the data) varied over a longer time scale? The historic_co2 data frame in dslabs contains direct measurements of atmospheric co2 from Mauna Loa since 1959 as well as indirect measurements of atmospheric co2 from ice cores dating back 800,000 years.

Make a line plot of co2 concentration over time (year), coloring by the measurement source (source). Save this plot as co2_time for later use.

co2_time <- historic_co2 %>%
  filter(!is.na(co2)) %>%
  ggplot(aes(year, co2, col=source)) +
  geom_line() +
  ggtitle("Atmospheric CO2 concentration, -800,000 BC to today") +
  ylab("co2 (ppmv)")
co2_time

Question 12
One way to differentiate natural co2 oscillations from today’s manmade co2 spike is by examining the rate of change of co2. The planet is affected not only by the absolute concentration of co2 but also by its rate of change. When the rate of change is slow, living and nonliving systems have time to adapt to new temperature and gas levels, but when the rate of change is fast, abrupt differences can overwhelm natural systems. How does the pace of natural co2 change differ from the current rate of change?

Use the co2_time plot saved above. Change the limits as directed to investigate the rate of change in co2 over various periods with spikes in co2 concentration.